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ABSTRACT (Conttnum on rmverem eidm if necetemry mnd identity by block number) 

This report summarizes the research findings of a four year contract 
investigating the applicability of item response theory and tailored testiny 
to criterion-referenced measurement. Six major areas were studied on the 
project. These included: (a) techniques for forming unidimensional item 
sets, (b) techniques for calibrating items, (c) item parameter linking 
procedures, (d) comparisons of latent trait niodels, (e) tailored testing 
procedures, and (f) decision making procedures. The results showed that 
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factor analytic procedures were best at forming uni dimensional item pools, 
the L06IST calibration program performed slightly better than the ANCILLES 
program for item calibration, the maximum likelihood procedure using the 
L06IST program generally gave the besc linking, the three-parameter logistic 
model was preferred to the one-parametor model for tailored testing applica- 
tions, the maximum likelihood based tailored testing procedure was slightly 
preferred to the Owen's Bayesian based procedure, and the use of the sequen- 
tial probability ratio test with tailored testing resulted in substantial 
savings in test length. Overall, tailored testing was shown to be feasible 
for achievement testing applications. More detailed results are described 
in the papers and reports listed in this report. 
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FINAL RLPORT: 

PROCEDURES FUR CRITERION-REFERENCED TAILORED TESTING 



The purpose of this contract has been to investigate the applicability 
of item response theory (IRT) and tailored testing to criterion-referenced 
measurement. Since criterion-referenced measurement involves creating an 
item domain, setting criterion cutoffs, and making a decision as to the 
location of an e/aminee relative to the cutoff, the applicability of item 
response theory and tailored testing had to be evaluated for each of these 
components. 

The investigation of the first component, creating an item domain, in- 
volved evaluating procedures for forming unidimensional item sets, proce- 
dures for item calibration and procedures for linking calibrations together 
to form large item pools. This was done since IRT requires an assumption of 
unidimenslonali ty, and large item pools are required for tailored testing. 
The setting of criterion cutoffs and the decision making aspect of criterion- 
referenced testing required the investigation of the IRT/tailored testing 
approach to achievement testing and to decision making. Therefore, the 
various IRT and tailored testing models were evaluated for achievement 
testing applications and a decision making procedure was developed for 
tailored testing applications. Each of these components will now be des- 
cribed in detail and research results will be summarized. 

Formation of Unidimensional Item Sets 



Because of the assumption that the items used in an IRT based procedure 
can be described in a unidimensional latent space, it was considered an im- 
portant component of this project to evaluate procedures for selecting test 
items to meet this assumption. Therefore, a 3tudy was planned to evaluate 
the ability of various procedures to determine the dimensionality of a set 
of test items and to sort items into sets measuring a single dimension. The 
procedures evaluated included factor analysis, nonmetric multidimensional 
scaling, cluster analysis, and latent trait theory analysis. These proce- 
dures were applied to simulated and real test data of varying factorial com- 
plexity. In all cases, guessing was present in the data :>ince multiple 
choice items were assumed for the tailored testing application. 

The results of this study are reported in: 
Reckase, M. D., The formation of homogeneous item s ets when guessing is a 

factor in item response (Researc^h Report 81-5). Columbia,"' V\0: univ"er- 

sity of Missouri, August 1981. 
and in: 

Reckase, M. D. Guessing and dimensionality: The search for a unidimensional 
latent space. Paper presented at the meeting of the American Educational 
Research Association, Los Angeles, April 1981. 

Reckase, M. D. The effect of guessing in dichotomously scored items on the 
operation of multivariate data reduction techniques. Paper presented at 
the meeting of the Psychometric Society, Iowa City, lA, May 1980. 

The results indicated that factor analytic and nonmetric multidimensional 
scaling techniques could be used to sort items into unidimensional sets and 
that cluster analysis and latent trait analysis were generally not appropriate. 
Of the factor analytic techniques evaluated, principal factor analysis of 
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phi-coeff icients was found to give the best inforrpation for determining the 
dimensionality of a set of items. Nonmetric multidimensional scaling was ^ 
found to work well with some similarity coefficients while giving fairly 
meaningless results with others. The MDSCAL program (Kruskal, 1964) used 
gave best results with the Yule's Y coefficient, phi -coefficient and tetra- 
choric correlations , 

The cluster analysis procedure was found to be inappropriate because of 
difficulty in determining how many clusters should be present in th.e data. 
Both hierarchial and complete link procedures were evaluated. The latent 
trait procedures worked fairly well, but were too cumbersome for general use. 
The procedure involved successive applications of LOGIST (Wood, Wingersky, 
and Lord, 1976) to a set of test items, deleting the items with low a_-values 
after each run. Unidimensional subsets were usually formed in this way, 
but as ^^any as ten program runs were required for each item set. This was 
clearly impractical, especially considering the cost of the program runs. 

Item Calibration 

Once it had been determined that a set of items met the assumption of 
a unidimensional latent space, the items needed to be calibrated according 
to one of the IRT models* That is, the parameters 01 Me model needed to 
be estimated using one of the available computer programs for that purpose. 
There are several IRT models that could be used for tailored testing appli- 
cations and each of these models has several calibration programs for use in 
estimating its parameters. One of the initial tasks of this contract wa^ to ^ 
compare the various models available and to evaluate their calibration prog- 
rams . 

The results of a review of the literature and a comparison of item 
calibrations models are presented in: 

Reckase, M. D. Ability estimation and item calibration using theone and 
three parameter logis'tic models: A comparative study (ReseaTch Report 
77-1). Columbia, MO: University of Missouri, November 1977. 

and 

McKinley, R. L. & Reckase, M. D. A comparison of the ANCILLES and LOGLST 
parameter estimation procedures for the three - parameter logistic mod el 
using goodness of"7Tt as a crfte^ (Research Report 80-2). ColumFi a , 
MO: University of Missouri, December 1980. 

Other papers related to this topic are: 

Reckase, M. D. Unifactor latent trait models applied to multifactor tests: 
Results and implications. Journal of Educational Meas u remen t, 1979, 
4(3), 207-230. ~ 

McKinley, R. L. & Reckase, M. D, The fit of ICC's based on two different 
three-parameter logistic model parameter estimation procedures. Paper 
presented at tne meeting of the Psychometric Society, Chapel Hill, NX., 
May 1981. 

Reckase, M. D. The validity of latent trait models through the analysis of 

fit and invariance. Paper presented at the meeting of the American 

Educational Research Association, Los Angeles, April 1981. 
i:eckase, M. D. A comparison of the one- and three-parameter logistic models * 

for item calibration. Paper presented at the meeting of the American 

Educational Research Association, Toron':o, Marcn 1978. 
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Reckdse, M. D. Univariate latent trait models applied to multivariate measures. 
Paper presented at the meeting of the Psychometric Society, Chapel Hill, 
N.C., May 1977. 

The results of the research on calibration techniques can bo divided into 
two parts: the comparison of calibration models, and the comparison of calib- 
ration programs for a single model. The comparison of calibration models 
concentrated on the one- and three-parameter logistic models. The results of 
the research on the calibration models showed that the three-parameter model 
fit empirical data better than the one-parameter model, but that the sample 
size required to use the three-parameter model was substantially larger than 
for the one-parameter model. Lack of fit was most prominent for the one- 
parameter model when guessing was a factor in item response. 

The models were also found to yield ability estimates that measured 
different constellations of ability when items tapping several dimensions 
were used in a test. The one-parameter logistic based ability estimates were 
related to the sum of the components present in a test, while the three- 
parameter logistic based ability estimates were found to be mainly related 
to the single largest component in a test. This difference in ability esti- 
mates is due to differences in the weighting of the item responses for the 
two models. Unit weights are used for the one-parameter model, while the 
items are weighted by the item discrimination parameter estimates for the 
three-parameter model. Despite these differences, the ability estimates 
obtained from the models were found to be highly correlated for many tests 
composed of a fixed set of items. The controlling factor seemed to be the 
magnitude of the first principal component of the test. When the item calib- 
ration results from multiple choice tests were to be used for tailored testing 
purposes, the three-parametfer logistic model was found to be superior because 
of the better fit to empirical data. The ability to approximate guessing 
effects was found to be especially important because the error induced in 
the one-parameter item parameter estimates by this factor, 

>\mong the numerous available item calibration procedures, the ANCILLES 
(Urry, 1978) and LOGIST (Wood, Wingersky & Lord, 1976) procedures were selected 
for comparison on this project because of their wide useage by the testing 
community. The results of the research showed that the ICC estimates from 
the LOGIST program fit the empirical item data slightly better than those from 
the ANCIllES program. For this reason, the LOGIST program was suggested for 
item calibration for use with tailored testing procedures. 

In addition to coniparing existing calibration models and programs, a new 
estimation procedure was developed on the project by Robert Tsutakawa and 
Steve Rigdon. This new procedure is describeo in detail in the report: 
Rigdon, S, E. & Tsutakawa, R. K. Estimation in latent trait models (Research 

Report 80-1). Columbia, KO: University of Missouri, May 1981. 

The investigation focused on the estimation of ability and item parameters 
for a class of binary response models, including the commonly used logistic 
and probit models. Estimation procedures were examined for variations of these 
models depending on whether only the ability parameters or both ability and 
item parameters are assur^ed random with prior distributions with fixed but 
unknown hyperparampters. 
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When the item parameters are fixed and the ability parameters are 
random, the EM algorithm can be readily adapted. Estimates of ability 
parameters are easily found, even in situations where maximum likelihood 
estimates do not exist. Simulation studies have shown that these esti- 
mates are more efficient than maximum likelihood estimates in terms of the 
mean square error criterion. The EM algorithm can be modified to esti- 
mate the item parameters by conditioning on the expected ability para- 
meters. This revised method is computationally much cheaper while perfor- 
ming as well as the straight EM algorithm. Although most of the numerical 
work has been restricted to the one-parameter logistic (Rasch) model, 
with a normal prior, the method extends to mul ti -parameter models. Com- 
puter programs for the two-parameter (for ability and guessing) logistic 
model are now being developed. 

When both item and ability parameters are considered random, the EM 
algorithm applies in principle but cannot be easily implemented since the 
random variables do not have distributions belonging to exponential families. 
The algorithm was modified by alternately estimating the ability and item 
parameters while holding one set fixed at its posterior expectation. Simu- 
lated results assuming normal priors indicated that the resulting estimators 
do not perform as well as the maximum likelihood estimator. This discrepancy 
disappears, however, when the prior distribution of the difficulty parameter 
was assumed uniform. 

The extent to which the models used here are applicable in practice 
remains to be seen. Some preliminary work was done on goodness of fit 
tests. Though it appears that the classical chi -square methods apply when 
ability parameters are from a common prior distribution, the amount of com- 
putation needed for even a moderate number of items may be prohibitive. 
A more feasible approach may be through the logarithmic penalty function 
mentioned by i^osteller and Wallace (1964) in their book on the Federalist 
Papers and examined in mor^ detail by Efron (1978). 

Two other areas included in the original research objectives were com- 
paring item response curves and designing sequential methods for mental 
testing. Work was limited since the procedures depend on the results from 
the EH algorithm study. 



Owce item paramet^^r estimates had been obtained from a series of group 
tests, they needed to be placed on the same scale (linked) so all of the 
items could be used in th^ tailored testing item pool. Numerous procedures 
have been developed for this linking task. A natural extension of the re- 
search on models and calibration procedures was to evaluate linking proce- 
dures to determine which gave the most accurate parameter estimates for 
use with tailored testing. The results of the evaluation were reported in 
the following report: 

McKinley, R. L. & Reckase, M. D. A comparison of procedures for constructing 



Missouri , August 1981. 
Some of the results of this research were also reported in: 
Reckase, M. D. Item pool construction for use with latent trait models. Paper 
presented at the meeting of the Afnerican Educational Research Association, 
San Francisco, April 1979. 



Item Parameter Linking^ 



large item pools (Research 
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The basic design for this pa; t of the research effort was to sample a 
series of short tests from a long test, link the calibrations of the short 
tests, and then coippare the linked parameter estimates to those obtained 
from the long test. This procedure was used to evaluate linking procedures 
for both the one-paramfeter and three-parameter models and to determine the 
necessary sample size and number of common items between tests for linking 
to be performed. The MAX calibration program (Wright & Panchapakesan, 1969) 
was used for the one-parameter model and the ANCILLES and LOGIST programs 
were used for the three-parameter model . 

The results of the research showed that far fewer cases were required 
to link the single paramete.^ of the one-parameter model than were needed 
for the parameters of the three-parameter model . Approximately 2000 cases 
seemed to be needed in the latter case. Of the four linking procedures 
evaluated for the three-parameter logistic model, maximum likelihood linking 
using the LOGIST program gave the best results overall when a 2000 sample 
was used. Fifteen items in common between test forms were sufficient for 
adequate linking. Future research should address the quality of items that 
should be in common between tests. 

Latent Trait Model 

Once an item pool has been produced using the calibration and linking 
methods described above, the actual tailored testing process can begin. To 
define that process, one of the many latent trait models must be selected 
as a basis for the tailored testing procedure. Of the many models available, 
the one- and three-parameter logistic models were evaluated for use in this 
project. A series of three tailored testing studies were run to determine 
which of these two rtiodels gave the most accurate ability estimates in a 
realistic testing setting. The following reports describe these studies in 
detail . 

Koch, W. R. & Reckase, M. D. A live tailored testing comparison study of 

the one- and three-parameter Togis'tTc^m o dels" (Research Report /8-1) . 

Columbia, MO: University of Missouri, June 1 978 . 
Koch, W. R. & Reckase, M. D. Problems in application of latent trait models 

to tailored testing (ResearcliH^eport 79-1). Columbia, MO: University 

of Missouri , September 1979. 
McKinley, R. L. & Reckase, M. D. A successful epplication of laten^ trait 

theory_ to tailored achievement testing (Research keport 80-1 ) . Col umb i a 

MCTi University of Missouri, February 1980. 
Other papers written on this topic were: 

McKinley, R. L. & Reckase M. D. Computer application to ability testing. 

AEDS Journal , 1980, 13^(3), 1^3-203. 
Reckase, M. IT. Procedures for computerized testing. Behavior Research 

Methods and Instrumentation , 1977, 9(2), 148-152. 
English, R. A., Rectase, M. d7 & Patience, W. M. Application of tailored 

testing to achievement measurement. Behavior Research Methods and 

Instrumentation , 1977, 9(2), 158-161. 
Reckase, M. D. Tailored testing, measurement problems, and latent trait 

theory. Paper presented at the meetinj of the National Council on 

Measurement in Education, Los Angeles, April 1981. 
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Patience, W. M. & Reckase, M. D. Self-paced versus paced evaluation utilizing 
computerized tailored testing. Paper presented at the meeting of the 
National Council on Measurement in Education, Toronto, March 1978. 

Reckase, M. D. Computerized achievement testing using the simple logistic 

model. Paper presented at the meeting of the American Educational Research 
Association, New York, April 1977. 

The one- and three-parameter logistic based tailored testing procedures 
used in these studies were both based on maximum information item selection 
and maximum likelihood ability estimation. The criteria for evaluation of 
the ability estimates obtained from the procedures were the information func- 
tion and reliability coefficients obtained when the procedures were applied 
in a realistic setting. Both vocabulary and achievement items were used in 
the evaluation. The populations used for the studies were upper level college 
students . 

The overall results obtained from the series of studies showed that a 
tailo/ed testing procedure based on the three- pa ram^^^ter logistic model gave 
both higher information values and higher reliability coefficients than the 
one-parameter model. The predictive validity of the ability estimate using 
scores on classroom achievement tests as a criterion was found to be about 
equal for the two riX)dels. Twenty item tailored tests were found to give 
about equivalent reliability to 50 item traditional tests on the same material. 

An important finding of the live testing research on tcildred testing was 
the aetermination of the sensitivity of the procedures to the accuracy of the 
item calibration information. When item parameter estimates were poor, the 
procedures gave meaningless results, regardless of the quality of the test 
items. Inaccurate parameter estimates also made the information function 
meaningless. High information values were sometimes obtained for tests with 
low reliabilities. These results point out the critical importance of item 
calibration and linking. 

In addition to evaluating the quality of ability estimates using the two 
procedures, research was done to determine the best way to operate the tailored 
testing procedure. This research involved determining the composition of the 
item pool, the appropriate place in the pool to start administering items, and 
the item selection procedure use before ability estimates were available. 
The results of this research are reported in the following reports and papers. 
Patience, W. M. & Reckase, M. D. Effects of program parameters and item pool 
characteristics on the bias of a three-parameter tailored testing proce- 
dure. Paper presented at the meetiig of the National Council on Measure- 
ment in Education, Boston, April 1980. 
Patience, W. M. & Reckase, M. D. Operational characteristics of a one-parameter 
tailored testing procedure (Research Report 79-2). Columbia, MO: Un1v- 
ersity of Missouri, October 1979. 
Patience, W. M. & Reckase, M. D. Operational characteristics of a Rasch model 
tailored testing procedure when program parameter and item pool attributes 
are varied. Paper presented at the meeting of the National Council on 
Measurement in Education, San Francisco, April 1979. 
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The methodology used for this research was to simulate the testing 
process for a hypothetical examinee of known ability and determine the mean 
and standard deviation of the obtained ability estimates. The procedure 
was considered acceptable if the resulting estimates were unbiased and had 
a small variance. An analytic procedure that traced all possible paths 
through the item pool for a person of known ability was also used to determine 
the statistical bias and variarKe of estimates. 



The results of this research indicated that the characteristics of the 
item pool are important in determining the quality of the ability estimates. 
The item pool should have a rectangular distribution of item difficulties, 
and a uniform level of item discrimination. Items with low discrimination 
parameter estimates are not selected by the tailored testing procedure so 
they should not be included in determining the size of the active item pool. 
It was also found to be important that the difficulty scale be uniformly 
covered by items. If gaps in the coverage were present, regions of the 
ability scale would be poorly estimated. 

Other recommendations can be made based on this research. First, the 
initial ability estimate used to start the testing session should be one 
that Selects a first item of about median difficulty since it is important 
that enough items are present both above and below the initial ability to 
give good estimation. Also, when using a maximum likelihood ability esti- 
mation procedure, the stepsize used before an estimate is obtained should 
be approximately .7 for the one-parameter procedure and .3 for the three- 
parameter procedure. Otherwise, the examinee's ability estimate may move 
out of the range where items are present before a good ability estimate can 
be obtained. 
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The recommendations given above should only be considered as rojgh guide- 
lines because of the complex interaction of the variables controlling the 
tailored testing situation. The best recourse is to simulate the operation 
of about 50 examinees at numerous points along the ability scale using the 
actual item parametet s from the item pool to be used. This procedure can 
be used to determine the accuracy of ability estimates that can be obtained. 
The controlling parameters of the tailored testing program can then be fine 
tuned to the item pool. 

Tailored Testing Procedure 

Based upon research reported up to this point, the three-parameter 
logistic model is a clear choice over the one-parameter logistic model for 
tailored testing applications. However, there are two commonly used proce- 
dures for applying the three-parameter model to tailored testing, Owen's 
Bayesian procedure (Owen, 1975) and the maximum likelihood procedure, and 
little work has been done to directly compare the two. A live testing study 
comparing these two procedures was conducted on this project to obtained 
information '^elevant to choosing between Uieni . The results of the study 
are given in the following report and paper: 

McKinley, R. L. & Reckase, M. D, A comparison of a Bayesia n and a max imum 
likelihood tailored tes tin c| procedure (Research Report 81-2) . Columbia, 
WT: University of i^issouri, August 1981. 
Rosso, M. A. & Reckase, M. D. A comparison of a maximum likelihocd and a 
Bayesian ability estimation procedure for tailored testing. Paper pre- 
sented at the meeting of the National Council on Measurement in Educa- 
O tion, Los Angeles, April 1981, 



This study compd red the reliability coefficients, information functions 
and ability estimates for achievement tests administered using either Owen*s 
Bayes^an procedure or a maximum likelihood tailored testing procedure. The 
Bayesian 'procedure selected items to minimize the posterior variance of the 
ability estimates and estimated ability as the mean of the posterior distri- 
bution of abilit;y. The maximum likelihood procedure selected items to maxi- 
mize the information function at the most recent ability estimate and esti- 
mated ability usirrg an empirical maximum likelihood approach. 

The results of the study showed that the two procedures had approximately 
equal reliabilities and *nformation functions. However, a definHe regres- 
sion effect was found to be a result of the Bayesian prior. In .his study, 
the prior was assumed to be normal with a mean near the median difficulty of 
the item pool and a variance of 1.0. Since the prior mean was somewhat lower 
than the ability'of the group tested, the ability estimates were artificially 
kept lower than the maximum likelihood estimates. Because of this effect, 
the maximum likelihood procedure ^as recommended if accurate prior informa- 
tion were not available. 

Decision Making Procedure 

The final ptoject undertaken on this contract was to investigate deci- 
sion making procedures for use with tailored tesf^ng. The most convenient 
procedure found for use v.i un tailored testing was based on Wald*s (1947) 
sequi»ntial probability ratio test. Research was done on the contract to 
determine the usefulness of ^uch a procedure. The results of the research 
effort •re presented in the following reports and papers. 
Reckase, M. D. The use o? the secjuential proLability: ratio test in making 
grade classifications in conjunction with tailored testing (ResearcTiT 
Report 81-4~ Columbia", MO: University of Missouri, August 1981. 
Reckase, M. D. Some decision procedures for use with tailored testing. In 
D. J. Weiss (Ed.), Proceeciings_o. the 3979 computerized aaaptive testing 
conference . Minneapolis, University of Minnesota , 1980. ^ 
.<eckase, M. D. An application of tailored testing and sequential analysis 
to classification problems. Paper pr'esented at the meeting of the 
American Educational Research Association, Boston, April 1980. 
Reckase, M. D. A generalization of sequential analysis to decision making 
with tailored testing. Paper presented at the meeting of the Military 
Testing Associatioft, Oklahoma City, OK: Novenber 1978. 

The SPRT procedure was investigated for use with tailored testing using 
both simulation and live testing techniques. Use of the SPRT with both the 
one-parameter and three-pararut .er models was studied. The results showed that 
a substantial reduction of test length could be attained through the use of 
the SPRT without loss of decision accuracy. As with previous work, the three- 
parameter procedure was found to yield better results than the or o-para^Tieter 
proceuure. The detrimental effects of guessing on the operation of the one- 
parameter logistic based tailored testing procedure were an especially impor- 
tant factor In the results. 



Summary and Conclusicn' 



This research project studied many facets of the application of IRi 
and tailored testing to achievement measuremenc. Included were studies of 
techniques lor sorting items into unidimensi onal item sets, calibrating 
tes^ items, linking item calibrations, estimating ability, selecting items 
for tailored testing, and making decisions using tailored testing. Overall, 
the results were fairly positive. Unidimensional item sets can be formed 
using the principal factor technique on phi coefficients and the items can 
be calibrated for use with tailored testing using the LOGIST program if a 
sufficient sample of individuals is available. The calibration of separate 
tests can be linked to produce a large item pool if at least 15 items are 
in common between the tests and a sample of performance for at leas.-c 2000 
individuals is available for each test. The maximum likelihood procedure 
using the LOGIST program is reconmendcd for the linking. The three-para- 
• meter logistic model has been shown to give an adequate theoretical basis 
for tailored testing, even for achievement testing which does not quite 
meet the assumptions of the model. A tailored testing model based on maxi- 
mum infOrniation item selection and maximum likelihood ability estimation 
is recommended for use, with an expected result of reducing the number of 
items required to obtain a test reliability equal to that of a traditional 
test more than twice as long. Accurate decision making has been, shown to 
be possible u ng the sequential probability ratio test with tailored testing 
with a substantial reduction in the number of test items administered and 
tight control of errors of classification. 

With all of these -positive results, there are still many areas in which 
the user of tailored testing must exercise extreme caution. Unlike tradi- 
tional paper and pencil testing, tailored testing is critically dependent 
on the quality of the calibration and linking of the item pool, if the item 
parameters are poorly estimated, the item selection procedure and ability 
estimation procedure will be operating on meaningless numbers and will tend 
to give meaningless results. The situation is equivalent to determining 
the length of a line with a ruler that has its units marked off in the wrong 
places. Trusting the calibration too much can give test results that look 
good (i.e., have high Information functions), when the reliability of the 
scores is in fact very low. As a ^ suit, tailo»^ed tests should still be 
evalUatec for quality using proc independent of the item parameters, 

such as test-retest reliability. 

The use of the three-parameter logistic model as a basis for ability 
estimation causes some subtle problems in test score interpretation. Using 
this model causes item responses to be weighted by the discrimination para- 
meter estimates when computing an estimate of ability. This jves high weight 
to those items measuring the major component of a test and very low weight 
to llems not measuring that component. The result is an ability estimate 
measuring a trait that is more unidimensional than is obtained from a number 
correct score. This difference must be taken into account when making use 
of tailored testing. 
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The use of tailored testing for measurement is similar in many ways 
to the use of computers for computation. The techniques give high power 
and efficiency based on high technology. But a price is paid for the ad- 
vantages. The price is a greater sensitivity to the input to the proce- 
dure and a greater dependence on complicated hardware. The results of 
this research contract definitely show that tailored testing can be applied 
to achievement measurement with many advantages. However, the technique 
cannot be applied carelessly and still achieve those advantages. 
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